Skip to content

[spark] Support nested fields in SparkFilterConverter#8399

Open
Mesut-Doner wants to merge 2 commits into
apache:masterfrom
Mesut-Doner:spark_nested_fields
Open

[spark] Support nested fields in SparkFilterConverter#8399
Mesut-Doner wants to merge 2 commits into
apache:masterfrom
Mesut-Doner:spark_nested_fields

Conversation

@Mesut-Doner

@Mesut-Doner Mesut-Doner commented Jun 30, 2026

Copy link
Copy Markdown

Purpose

Currently, SparkFilterConverter throws an UnsupportedOperationException when it encounters dot-separated nested field paths (e.g. a.b.c). This limits predicate pushdown capabilities in Spark when queries filter on nested Struct types.

This PR implements nested field support in SparkFilterConverter so that Spark V1 Filter objects on nested fields are correctly converted into Paimon Predicate structures:

Nested Schema Resolution: Added getNestedFieldType(...) and resolveField(...) to recursively walk the RowType schema along dot-separated path components to find the correct nested field's DataType.
Transform-based Predicate Conversion: Refactored the converter branches (EqualTo, In, IsNull, IsNotNull, GreaterThan, etc.) to use FieldTransform(FieldRef) and call the corresponding PredicateBuilder methods that accept Transform instead of index-based builders.
Literal Conversion: Updated convertLiteral(...) and convertString(...) to correctly resolve nested field paths and convert literals to their matching leaf data type.

Tests

Added a new test case testNestedField() in SparkFilterConverterTest to verify that nested struct field predicates are successfully converted to Paimon predicates.

@Mesut-Doner Mesut-Doner force-pushed the spark_nested_fields branch from 990ec18 to 85017e3 Compare June 30, 2026 15:39
@Mesut-Doner Mesut-Doner changed the title Spark nested fields [spark] Support nested fields in SparkFilterConverter Jun 30, 2026
}
}

Transform transform = new FieldTransform(new FieldRef(topLevelIndex, field, fieldType));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not actually evaluate the nested field. FieldTransform only reads InternalRowUtils.get(row, fieldRef.index(), fieldRef.type()), so for a filter like a.b = 1 this FieldRef still reads top-level column a (index 0) but with b type. That can make predicate.test(row) and statistics pruning read the struct column as an int/string instead of traversing into b. The new test only checks toString(), so it misses this runtime behavior. Please add a real nested-field transform/path traversal, or keep these filters unsupported until evaluation and stats pruning can handle nested paths correctly.

@JingsongLi JingsongLi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a few correctness/compatibility issues in the nested Spark filter support.

this.index = index;
this.name = name;
this.type = type;
this.nestedIndexes = nestedIndexes;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because FieldRef now carries nestedIndexes / nestedArities, every place that remaps a FieldRef has to preserve them. Existing remappers such as PredicateProjectionConverter, PartitionValuePredicateVisitor, and TableQueryAuthResult still rebuild refs with the 3-arg constructor, so a pushed predicate like a.b = 1 loses its nested path after projection/auth/partition remapping. The resulting FieldTransform then reads top-level a as the leaf type instead of traversing to b, which can produce wrong filtering or runtime failures. Please add a copy/remap helper that preserves the nested metadata and use it for all FieldRef rewrites before enabling nested predicates.

In in = (In) filter;
int index = fieldIndex(in.attribute());
FieldInfo fieldInfo = resolveField(in.attribute());
return builder.in(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes empty IN behavior even for top-level columns. PredicateBuilder.in(int, ...) explicitly handles an empty literal list by creating an In leaf predicate, but PredicateBuilder.in(Transform, ...) falls through to or(empty) and throws. As a result, an empty Spark In filter is no longer converted consistently. Please make the transform overload handle empty lists the same way as the index overload.

}

private FieldInfo resolveField(String field) {
String[] parts = field.split("\\.");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This unconditionally treats any dot as a nested path. Paimon schema validation does not forbid top-level column names containing ., and the previous converter resolved those fields with rowType.getFieldIndex(field). With a top-level column named a.b, this now looks for top-level a and either fails or resolves a different field. Please try an exact top-level field lookup first, and only split as a nested path when no exact field exists.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants